This report explores a dataset containing quality and attributes for approximately 4898 observations. And I will analyze about how the quality of white wines is affected by other attributes. Moreover, I will explore if there are some relationship among other attributes.
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Our dataset consists of 13 variables, with 4898 observations.
Firsly, I want to see the distribution about the quality of white wine.
I transfer ‘quality’ numeric variable to factor variable.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
From this, we could see that most wines have quality 5, 6 and 7. And there are no quality for 0, 1, 2, 10.
Then, I am curious about the effect to the quality by different ingradients in white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
After I created this plot, we could see the distribution of fixed.acidity for white wines. In the histogram, most wines have fixed acidity between 5.8g/dm^3 ~ 7.8g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
After I created this plot, we could see the distribution of residual.sugar for white wines. In the histogram, it is a little bimodel ditribution on the logq0 scale, most wines have residual.sugar at around 1.5g/dm^3 and 8g/dm^3 to 12g/dm^3. From the summary, we find there are some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
After I created this plot, we could see the distribution of chlorides for white wines. In the histogram, most wines have chlorides between 0.03g/dm^3 ~ 0.06g/dm^3. And from the summary, we find there are some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
After I created this plot, we could see the distribution of total.sulfur.dioxide for white wines. In the histogram, most wines have total.sulfur.dioxide between 100g/dm^3 ~ 160g/dm^3. And we could also see that it is close to a normal distribution. From the summary, we find there are a few outliers. The mean is 138.4 and the median is 138.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
After I created this plot, we could see the distribution of pH value of white wines. In the histogram, most wines have pH value between 3.0 ~ 3.3. And we could also see that it is close to a normal distribution.
From the summary, the mean is 3.188 and the median is 3.180.
Then, we will analyze the density distribution of white wines.
summary(wq$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
From this summary, we could calculate IQR=0.9961-0.9917=0.0044. So the upper fence is 0.9961+1.5IQR = 1.0027; the lower fence is 0.9917-1.5IQR = 0.9815.
The histogram shows that the density distribution is almostly normal. in the summary, we find there are some outliers for density.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
After I created this plot, we could see the distribution of alcohol percentage of white wines. In the histogram, wines alcohol percentage is more average than other attributes. Even the median and mean of alcohol are close, it is not normal distribution.
Next, I am curious about the sweet level of white wines, so I will creat a new variable ‘sweetness’ for the further analysis.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## 7 7 6.2 0.32 0.16 7.0 0.045
## 8 8 7.0 0.27 0.36 20.7 0.045
## 9 9 6.3 0.30 0.34 1.6 0.049
## 10 10 8.1 0.22 0.43 1.5 0.044
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## 7 30 136 0.9949 3.18 0.47 9.6
## 8 45 170 1.0010 3.00 0.45 8.8
## 9 14 132 0.9940 3.30 0.49 9.5
## 10 28 129 0.9938 3.22 0.45 11.0
## quality sweetness
## 1 6 medium
## 2 6 dry
## 3 6 medium dry
## 4 6 medium dry
## 5 6 medium dry
## 6 6 medium dry
## 7 6 medium dry
## 8 6 medium
## 9 6 dry
## 10 6 dry
## dry medium dry medium sweet
## 2410 1662 825 1
From this histogram, we could see the distribution of sweet level of white wines. Most white wines are dry and medium dry, which account for 83% among all white wines in this dataset.
There are 4898 observation in this dataset with 13 variables. There are “fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, “quality”, “sweetness”.
I transfer the variable quality into factor, and all other variable are numeric variables.
worst —> best quality 0,1,2,3,4,5,6,7,8,9,10
Other observations: 1. Most white wines have quality 6. 2. A lot of wines have fixed acidity between 5.8g/dm^3 ~ 7.8g/dm^3. 3.For the ingredients residual.sugar and chlorides, they both have a long tail and some outliers. Many white wines have residual.sugar around 1.5g/dm^3 4.The distributions for total.sulfur.dioxide and pH of white wines are close to normal distribution. 5.Alcohol percentage in white wines is more average than other attributes, not normal distribtuion. 6.Most white wines are dry and medium dry, which account for 83% among all white wines in this dataset.
The main feature in this data set is about how different ingredients affect the white wine quality.The purpose of this project is to analyze the quality related to following ingredients: fixed.acidity, residual.sugar, chlorides, total.sulfur.dioxide, pH, alcohol.
The other feature sweetness will also help us to investigate the quality of the white wines, because the sweet level affect the taste of white wines, so it is influence the experts to grade the wines.
I created a variable sweetness which represents the sweet level of white wines. There are four levels for sweetness: dry, medium dry, medium, sweet.
For the quality variable, the range is from 3 to 9. There is no quality for 0, 1, 2 or 10. I think this is beacause the grades are given by the experts, so the grades are very subjective. They don’t give very low or full grades for the white wines.
Wines alcohol percentage is more average than other attributes. Even the median and mean of alcohol are close, it is not normal distribution.
For sweet levels of white wines, there is only one wine is in sweet catagory in this data set. Maybe it is beacause white wines are not supposed to be very sweet.
Firstly, I want to explore more about the correlation of coefficient for each variable.
## fixed.acidity residual.sugar chloride
## fixed.acidity 1.00000000 0.08902070 0.02308564
## residual.sugar 0.08902070 1.00000000 0.08868454
## chloride 0.02308564 0.08868454 1.00000000
## total.sulfur.dioxide 0.09106976 0.40143931 0.19891030
## density 0.26533101 0.83896645 0.25721132
## pH -0.42585829 -0.19413345 -0.09043946
## alcohol -0.12088112 -0.45063122 -0.36018871
## quality -0.11366283 -0.09757683 -0.20993441
## sweetness 0.08462378 0.93375334 0.09117180
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.425858291
## residual.sugar 0.401439311 0.83896645 -0.194133454
## chloride 0.198910300 0.25721132 -0.090439456
## total.sulfur.dioxide 1.000000000 0.52988132 0.002320972
## density 0.529881324 1.00000000 -0.093591493
## pH 0.002320972 -0.09359149 1.000000000
## alcohol -0.448892102 -0.78013762 0.121432099
## quality -0.174737218 -0.30712331 0.099427246
## sweetness 0.407802136 0.78928904 -0.186400065
## alcohol quality sweetness
## fixed.acidity -0.1208811 -0.11366283 0.08462378
## residual.sugar -0.4506312 -0.09757683 0.93375334
## chloride -0.3601887 -0.20993441 0.09117180
## total.sulfur.dioxide -0.4488921 -0.17473722 0.40780214
## density -0.7801376 -0.30712331 0.78928904
## pH 0.1214321 0.09942725 -0.18640006
## alcohol 1.0000000 0.43557472 -0.46255482
## quality 0.4355747 1.00000000 -0.10204075
## sweetness -0.4625548 -0.10204075 1.00000000
In order to see the correlation of coefficient between all variables, I create a dataframe M and plot M as a correlation matrix.
I creat this plot filled by quality, so as to see the quality distriution in fixed.acidity. The most quality are distributed normally in fixed.acidity.
From these two plots, we could see that the qualit 3 has widest range in fixed.acidity, almost two times than others. And the quality 9 has a very narrow range, only half of others.
## wq$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.575 7.300 7.600 8.525 11.800
## --------------------------------------------------------
## wq$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.800 6.400 6.900 7.129 7.600 10.200
## --------------------------------------------------------
## wq$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.500 6.400 6.800 6.934 7.400 10.300
## --------------------------------------------------------
## wq$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.838 7.300 14.200
## --------------------------------------------------------
## wq$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.200 6.700 6.735 7.200 9.200
## --------------------------------------------------------
## wq$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.800 6.657 7.300 8.200
## --------------------------------------------------------
## wq$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.60 6.90 7.10 7.42 7.40 9.10
From this table, we could assure what we find in the two plots. The most quality are distributed averagely, and the qualit 3 has widest range while the quality 9 is narrow range.
This plot displays the quality distriution in residual.sugar. This distribution shows each quality is kind of right skew.
This plot displays the quality distriution in chlorides. Most quality are distributed normally in chlorides log10.
This plot shows that the distriution of every quality in total.sulfur.dioxide is alomost normal.
This dendity stacking by quality plot shows the quality distriution in density, and the distribution is kind of normal for each quality in density.
From this boxplot of each quality in density, we could find that the highest quality probably has lowest density. For the quality 9 in the plot, it has lower density than any others.
## wq$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
## --------------------------------------------------------
## wq$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
## --------------------------------------------------------
## wq$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
## --------------------------------------------------------
## wq$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## wq$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## wq$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
## --------------------------------------------------------
## wq$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
From this table, we may speculate that the better quality white wines have lower density relatively.
This plot shows that the distriution of every quality in pH is normal.
This plot still displays the quality distribution in alcohol. But it is a little special than other attributes. Most quality distribute normally in alcohol range between 8% to 12%, while there are almost all white wines above quality 5 after alcohol percentage is 12 to 14.
## wq$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.34 11.00 12.60
## --------------------------------------------------------
## wq$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## wq$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## wq$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## wq$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## wq$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## wq$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
Form the boxplot and statistics I create above, we may infer that the high quality white wines are usually with high alcohol percentage.
The histogram of each quality in alcohol percentage display to us more clearly that most quality distribute normally in alcohol range between 8% to 12%. However, the quality above 5 also have distribution between 12% to 14%. So it is probable beacause the good quality white wines have higher alcohol percentage.
## wq$quality: 3
## dry medium dry medium sweet
## 11 6 3 0
## --------------------------------------------------------
## wq$quality: 4
## dry medium dry medium sweet
## 104 47 12 0
## --------------------------------------------------------
## wq$quality: 5
## dry medium dry medium sweet
## 578 565 314 0
## --------------------------------------------------------
## wq$quality: 6
## dry medium dry medium sweet
## 1063 766 368 1
## --------------------------------------------------------
## wq$quality: 7
## dry medium dry medium sweet
## 551 225 104 0
## --------------------------------------------------------
## wq$quality: 8
## dry medium dry medium sweet
## 99 52 24 0
## --------------------------------------------------------
## wq$quality: 9
## dry medium dry medium sweet
## 4 1 0 0
Form the histogram and statistics above, the number of each quality of white wines gets less and less by sweet levels from dry to sweet. We may speculate that most white wines are dry and medium dry.
I create this scatterplot bewteen residual.sugar and density with linear smoother, we could see that residual.sugar has a strong linear correlation with density. It is probable that the higher residual.sugar white wines have higher density.
I create this scatterplot bewteen alcohol and density with linear smoother, we could see that alcohol has a strong linear correlation with density. It is probable that the higher residual.sugar white wines have lower density.
These two plots both display that the density of white wines get higher and higher while the sweet level increases. it also corresponds with the feature between density and residual.sugar in white wines we analyzed above.
I create this scatterplot bewteen total.sulfur.dioxide and density with linear smoother, we could see that total.sulfur.dioxide has moderate linear correlation with density. Maybe density of white wines increases more or less when total.sulfur.dioxide is higher.
The quality of white wines is not affected by these attributes that much.
The correlation coefficient about quality and these attributes are mostly under 0.3. Only the coefficient of quality and density is 0.31, and quality and alcohol is 0.44. That indicates that the quality of white wines may only have small influence by density and alcohol.
The density of white wines have some linear relationship with some other attributes. It has stong linear correlation with residual.sugar and alcohol. The density goes up when residual.sugar goes up; while the density goes down when alcohol percentage goes up. The density has moderate relationship with total.sulfur.dioxide.
Since the number of each quality of white wines all decrease by sweet levels from dry to sweet. We may infer that most white wines are dry and medium dry.
Although the quality of white wines does not affect by these attributes that much, yet the density of white wines have linear correlation with residual.sugar, alcohol and total.sulfur.dioxide in the wines.
The density of white wines goes up when residual.sugar goes up; while the density goes down when alcohol percentage goes up.
The strongest relationship is between residual.sugar and sweetness. That is because I divide the sweetness level just by residual.sugar in white wines.
In this scatterplot filled by quality variable, we could see that the density increases gradually while the residual.sugar goes up.
And the color of quality 8 and 9 are mostly at the bottom while the color of low quality are above when at the same value of residual.sugar. That maybe because the higher quality white wines have lower density relatively.
moreover, from the plot, we could find that a lot of points of each quality level are at the left side of x axis. That infers most white wines have low residual.sugar no matter what quality they are.
This scatterplot displays that density and total.sulfur.dioxide have moderate linear correlation. And for each quality, most of the points are around the center of x axis, but less and less for two side. That indicates every quality distributes normally in total.sulfur.dioxide.
In this scatterplot filled by quality variable, we could see that the density decreases gradually while the alcohol percentage goes up.
And on direction of x axis, when the alcohol percentage is more the 12, there is no quality 3 or quality 4.
So from these, they may indicate that the better quality white wines have higher alcohol percentage with lower density as well.
This picture just displays what we explored density, sweetness and quality before more visually. The density of white wines increases while the sweet level goes up. And at the same time, the number of each quality decreases from dry to sweet, which means all white wines, no matter what quality, are usually dry and medium dry. Most blue and purple color which represent quality 7 and 8 are on left side of x axis with lower density.
This picture shows us density and residual.sugar do have strong relationship. And better quality white wines have lower density relatively. Drier white wines usually have lower density as well. But Sugar dose not influence the quality that much.
This picture shows us density and total.sulfur.dioxide have moderate relationship. But sulfur.dioxide dose not influence the quality that much.
This picture shows us density and alcohol do have strong relationship. Moreover, higher quality white wines are kind of with lower density and higher alcohol percentage. However, the qualtiy is not affected by sweetness.
##
## Call:
## lm(formula = I(as.numeric(quality)) ~ alcohol + chlorides + citric.acid +
## density + fixed.acidity + free.sulfur.dioxide + pH + residual.sugar +
## sulphates + total.sulfur.dioxide + volatile.acidity, data = wq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.482e+02 1.880e+01 7.881 3.98e-15 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## chlorides -2.473e-01 5.465e-01 -0.452 0.65097
## citric.acid 2.209e-02 9.577e-02 0.231 0.81759
## density -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## fixed.acidity 6.552e-02 2.087e-02 3.139 0.00171 **
## free.sulfur.dioxide 3.733e-03 8.441e-04 4.422 9.99e-06 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## residual.sugar 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## sulphates 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## total.sulfur.dioxide -2.857e-04 3.781e-04 -0.756 0.44979
## volatile.acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
From this linear model, R-squared is just 0.2819, which means the fit of this model is not very good. Only around 28% quality of white wines are due to these attributes. It also indicates what we found before that the quality of white wines are not affected much by these attributes.
Secondly, Significance Stars represents significance levels, with the number of asterisks displayed according to the p-value computed. *** means high significance while one star means low significance. In this case, alcohol, density, free.sulfur.dioxide, pH, residual.sugar, sulphates and volatile.acidity have three stars, indicating that it’s unlikely that no relationship exists between quality and these variables. # Multivariate Analysis
There are not strong relationship for the quality of white wines with these attributes in the dataset. Only the density have some linear relationship with sugar and alcohol in white wines.
There are not surprising interactions between features in this dataset.
I build the linear model for the qualtiy and other attributes in White wines. And I found the quality isn’t related to these attributes that much, which means the linear model doesn’t predict very well for the quality from these attributes in white wines.
The limitation of this model is these data are just from one manufacturer. So we don’t know what about the white wines from other wines factories. Also, this model can’t represent for the white wines from other manufacturers.
From this plot, we could see that the distribution of white wines quality is almost normal distribution. Most wines have quality 5, 6 and 7. And there are no quality for 0, 1, 2, 10. I think this is because the grade is given by the experts who taste the white wines, so the grades are subjective. Maybe the experts don’t want to give the too low grades or full grade.
In this density vs sugar scatterplot stacked by quality, we could see that the density increases gradually while the residual.sugar goes up as well.
It also displays that the color of quality 8 and 9 are mostly at the bottom while the color of low quality are above when at the same value of residual.sugar. That maybe because the higher quality white wines have lower density relatively.
Futhermore, from the plot, we could see that a lot of points of each quality level are at the left side of x axis. That indicates most white wines have low residual.sugar no matter what quality they are.
Firstly, the dendity stacking by quality plot displays the quality distriution in density, and the distribution is close to normal distribution for each quality in density.
Secondly, The density vs alcohol scatterplot filled by quality displays us the relationship among these variables more clear. We could see that the density decreases gradually while the alcohol percentage goes up.
What’s more, from the color distribution of different color, when the alcohol percentage is more the 12, there is very few white wines with quality 3 or quality 4.
So from these plots, they may indicate that the better quality white wines have higher alcohol percentage with lower density as well.
In the summary, the quality of white wines are not influenced by other attributes. The qualtiy don’t have much correlation with these attributes. We could only speculate that higher quality white wines may be with lower density and higher alcohol percentage. And most wines have medium quality with grade 5, 6 and 7.
An surprising thing is the density has some strong linear correlation with residual.sugar, alcohol in white wines. However, the correlation among other attributes are all small. Their correlation coeffecient are all less than 0.5.
When I was working on this dataset, I found the challenge of exploratory data analysis is what plot I should use for more clear visulization for the analysis and how to firgure out the relationship among many variables. Also, when we explore some specific dataset, we need to have some preparation for the background of the data, so that we could figure out and speculate the data better in the particular context.
For this white wines data, the limitation is that the source of this data is only from one manufacturer. If there are more data from variety manufacturer, the analysis and speculation of the atttibutes in white wines could be more confident and persuasive.
Sweetness of wine https://en.wikipedia.org/wiki/Sweetness_of_wine
Analysis of White Wine Quality https://rstudio-pubs-static.s3.amazonaws.com/249236_218e87eee0b94a05acec856159875cd5.html
White Wine Quality Exploration by Swain Tseng https://rpubs.com/Swain/205356
Diamonds Exploration by Chris Saden https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html
Fitting & Interpreting Linear Models in R http://blog.yhat.com/posts/r-lm-summary.html